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Abstract. We study the min max optimization problem introduced in 1 22 1 for computing policies for batch 
mode reinforcement learning in a deterministic setting. First, we show that this problem is NP-hard. In the two- 
stage case, we provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in 
order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian 
relaxation where all constraints are dualized, leads to a conic quadratic programming problem. We also theoretically 
prove and empirically illustrate that both relaxation schemes provide better results than those given in 8221 . 
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1. Introduction. Research in Reinforcement Learning (RL) ll48l aims at designing com- 
putational agents able to learn by themselves how to interact with their environment to max- 
imize a numerical reward signal. The techniques developed in this field have appealed re- 
searchers trying to solve sequential decision making problems in many fields such as Finance 
ll26l . Medicine Il34ll35l or Engineering ll42l . Since the end of the nineties, several researchers 
have focused on the resolution of a subproblem of RL: computing a high-performance policy 
when the only information available on the environment is contained in a batch collection of 
trajectories of the agent ||T0l[l7l|28l[38l|42l|T9|. This subfield of RL is known as "batch mode 
RL". 

Batch mode RL (BMRL) algorithms are challenged when dealing with large or continu- 
ous state spaces. Indeed, in such cases they have to generalize the information contained in 
a generally sparse sample of trajectories. The dominant approach for generalizing this infor- 
mation is to combine BMRL algorithms with function approximators L6,,28j JXilll- Usually, 
these approximators generaUze the information contained in the sample to areas poorly cov- 
ered by the sample by implicitly assuming that the properties of the system in those areas 
are similar to the properties of the system in the nearby areas well covered by the sample. 
This in turn often leads to low performance guarantees on the inferred policy when large state 
space areas are poorly covered by the sample. This can be explained by the fact that when 
computing the performance guarantees of these policies, one needs to take into account that 
they may actually drive the system into the poorly visited areas to which the generalization 
strategy associates a favorable environment behavior, while the environment may actually be 
particularly adversarial in those areas. This is corroborated by theoretical results which show 
that the performance guarantees of the policies inferred by these algorithms degrade with the 
sample dispersion where, loosely speaking, the dispersion can be seen as the radius of the 
largest non-visited state space area. 

To overcome this problem, ll22l propose a minmax-type strategy for generalizing in 
deterministic, Lipschitz continuous environments with continuous state spaces, finite action 
spaces, and finite time-horizon. The min max approach works by determining a sequence 
of actions that maximizes the worst return that could possibly be obtained considering any 
system compatible with the sample of trajectories, and a weak prior knowledge given in the 
form of upper bounds on the Lipschitz constants related to the environment (dynamics, reward 



t Department of Electrical Engineering and Computer Science, University of Liege, Belgium 

1 



2 



R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux 



function). However, they show that finding an exact solution of the minmax problem is far 
from trivial, even after reformulating the problem so as to avoid the search in the space of 
all compatible functions. To circumvent these difficulties, they propose to replace, inside this 
min max problem, the search for the worst environment given a sequence of actions by an 
expression that lower-bounds the worst possible return which leads to their so called CGRL 
algorithm (the acronym stands for "Cautious approach to Generalization in Reinforcement 
Learning"). This lower bound is derived from their previous work |20 , 21 1 and has a tightness 
that depends on the sample dispersion. However, in some configurations where areas of the 
the state space are not well covered by the sample of trajectories, the CGRL bound turns to 
be very conservative. 

In this paper, we propose to further investigate the min max generalization optimization 
problem that was initially proposed in (22]. We first show that this optimization problem 
is NP-hard. We then focus on the two-stage case, which is still NP-hard. Since it seems 
hopeless to exactly solve the problem, we propose two relaxation schemes that preserve the 
nature of the min max generalization problem by targetting policies leading to high perfor- 
mance guarantees. The first relaxation scheme works by dropping some constraints in order 
to obtain a problem that is solvable in polynomial time. This results into a well known con- 
figuration called the trust- region subproblem 1 13 1. The second relaxation scheme, based on a 
Lagrangian relaxation where all constraints are dualized, can be solved using conic quadratic 
programming in polynomial time. We prove that both relaxation schemes always provide 
bounds that are greater or equal to the CGRL bound. We also show that these bounds are 
tight in a sense that they converge towards the actual return when the sample dispersion con- 
verges towards zero, and that the sequences of actions that maximize these bounds converge 
towards optimal ones. 

The paper is organized as follows: 

• in Section|2] we give a short summary of the literature related to this work, 

• Section|3]formalizes the min max generalization problem in a Lipschitz continuous, 
deterministic BMRL context, 

• in Section |4] we focus on the particular two-stage case, for which we prove that it 
can be decoupled into two independent problems corresponding respectively to the 
first stage and the second stage (Theorem 4.2 1: 

- the first stage problem leads to a trivial optimization problem that can be solved 
in closed-form (Corollary |4.3[ ), 

- we prove in Section |4!2] that the second stage problem is NP-hard (Corollary 
|4.7[ ), which consequently proves the NP-hardness of the general min max gen- 
eralization problem (Theorem 4.8 i, 

• we then describe in Section |5] the two relaxation schemes that we propose for the 
second stage problem: 

- the trust-region relaxation scheme (Section [5?T] i, 

- the Lagrangian relaxation scheme (Section f5.2| i, which is shown to be a conic- 
quadratic problem (Theorem 5.4 1, 

• we prove in Section 



5.3.1 that the first relaxation scheme gives better results than 



CGRL (Theorem|55 
we show in Section 



5.3.2 that the second relaxation scheme povides better results 



than the first relaxation scheme (Theorem |5.13| l, and consequently better results than 
CGRL (Theorem [5J4l l, 

we analyze in Section 5.4 the asymptotic behavior of the relaxation schemes as a 
function of the sample dispersion: 

- we show that the the bounds provided by the relaxtion schemes converge to- 
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Fig. 1.1. Main results of the paper. 



wards the actual return when the sample dispersion decreases towards zero 
(Theorem [STtI , 

- we show that the sequences of actions maximizing such bounds converge to- 
wards optimal sequences of actions when the sample dispersion decreases to- 
wards zero (Theorem |5.20| i, 

• Section|6]illustrates the relaxation schemes on an academic benchmark, 

• Section|7]concludes. 

We provide in Figure [LT] an illustration of the roadmap of the main results of this paper 

2. Related Work. Several works have already been built upon min max paradigms for 
computing policies in a RL setting. In stochastic frameworks, min max approaches are often 
successful for deriving robust solutions with respect to uncertainties in the (parametric) repre- 
sentation of the probability distributions associated with the environment In the context 
where several agents interact with each other in the same environment, min max approaches 
appear to be efficient strategies for designing policies that maximize one agent's reward given 
the worst adversarial behavior of the other agents. ||29l l43l . They have also received some 
attention for solving partially observable Markov decision processes Il30ll27l . 

The min max approach towards generalization, originally introduced in ll22l . implicitly 
relies on a methodology for computing lower bounds on the worst possible return (consid- 
ering any compatible environment) in a deterministic setting with a mostly unknown actual 
environment. In this respect, it is related to other approaches that aim at computing perfor- 
mance guarantees on the retuiTis of infeiTed policies ll33l 1411 l39l . 

Other fields of research have proposed min max-type strategies for computing control 
policies. This includes Robust Control theory ll24l with Hoc methods |j2l, but also IVIodel 
Predictive Control (IVIPC) theory - where usually the environment is supposed to be fully 
known lfT2l [TSl - for which min max approaches have been used to determine an optimal 
sequence of actions with respect to the "worst case" disturbance sequence occurring B4l l4l. 
Finally, there is a broad stream of works in the field of Stochastic Programming Q that 
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have addressed the problem of safely planning under uncertainties, mainly known as "robust 
stochastic programming" or "risk-averse stochastic programming" ITSl l45l l46l [36l . In this 
field, the two-stage case has also been particularly well-studied ll23l[T4l . 



3. Problem Formalization. We first formalize the BMRL setting in Section 3.1 and 



we state the minmax generalization problem in Section 3.2 



3.1. Batch Mode Reinforcement Learning. We consider a deterministic discrete-time 
system whose dynamics over T stages is described by a time-invariant equation 

xt+i ^ f {xt,ut) t = o, 

where for all t, the state xt is an element of the state space A" C M'* where M'' denotes 
the d— dimensional Euclidean space and Ut is an element of the finite (discrete) action space 
} that we abusively identify with {1, . . . , to}. T E N \ {0} is referred 
to as the (finite) optimization horizon. An instantaneous reward 

Tf = p{xt,ut) e M 

is associated with the action Ut taken while being in state Xt- For a given initial state xq £ X 
and for every sequence of actions (uq, . . . , ut-i) S Z^^, the cumulated reward over T stages 
(also named T— stage return) is defined as follows: 
Definition 3.1 (T-stage Return). 

T-l 

where 

xt+i = f{xt,ut) , Vie {0,..., T-l}. 

An optimal sequence of actions is a sequence that leads to the maximization of the T— stage 
return: 

Dehnition 3.2 (Optimal T-stage Return). 

^ max j^"o....,«T-i) _ 

(no,---,"T-l)eW^ 

We further make the following assumptions that characterize the batch mode setting: 

1. The system dynamics / and the reward function p are unknown; 

2. For each action u ElA^a. set of n^"' e N one-step system transitions 

is known where each one-step transition is such that: 

andr(")''==p(x(")■^^.) . 

3. We assume that every set J^*^"-* contains at least one element: 

Vw e U, n("' > . 
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In the following, we denote by T the collection of all system transitions: 

T = J-^^' U . . . U J-^"^ 

Under those assumptions, batch mode reinforcement leaming (BMRL) techniques propose 
to infer from the sample of one-step system transitions T a high-performance sequence of 

{ul,....u1j._^) 



e such that 



is as close 



actions, i.e. a sequence of actions (-Uq, . 
as possible to J^. 

3.2. Min max Generalization under Lipschitz Continuity Assumptions. In this sec- 
tion, we state the min max generalization problem that we study in this paper. The formal- 
ization was originally proposed in |22|. 

We first assume that the system dynamics / and the reward function p are assumed to be 
Lipschitz continuous. There exist finite constants Lf,Lp E M such that: 

y{x,x') e X^yueU, ||/(a;,M)-/(a;',u)|| <L/||a;-a;'|| , 

\p{x,u) - p{x',u)\ < Lp\\x - x'W , 

where || .|| denotes the Euclidean norm over the space X. We also assume that two constants 
L f and Lp satisfying the above-written inequalities are known. 

For a given sequence of actions, one can define the worst possible return that can be ob- 
tained by any system whose dynamics /' and p' would satisfy the Lipschitz inequalities and 
that would coincide with the values of the functions / and p given by the sample of system 
transitions As shown in [ 22 j , this worst possible return can be computed by solving a 



finite-dimensional optimization problem over X 



T-1 



Intuitively, solving such an op- 



timization problem amounts in determining a most pessimistic trajectory of the system that 
is still compliant with the sample of data and the Lipschitz continuity assumptions. More 
specifically, for a given sequence of actions {uq, . . . , ut-i) G some given constants Lf 
and Lp, a given initial state xa E X and a given sample of transitions this optimization 
problem writes: 



{VriJ^, Lf,Lp,xo,uo, . . .,ut-i)) 



Xo 



mm 

fx-i e K 
XT-i e X 



T-1 

Eft, 



subject to 



Tt-r 



(ut),kt 



< Li 



Xt - X 



{■u.t),kt 



xt+1 - y 

2 



,V(t,fci)e{o,...,r-i}x 

V(t,fc,)e{0,...,T-i}x {i,..., 



ft-ft'l' < i^llxt-xt'll' yt,t' e {0,...,T~l\ut^ut'}, 
||xt+i - Xt'+ill' < Lj ||xt - if II' , Vt, i' e {0, . . . , T - 2\ut ^ur}, 
Xo = a:o . 



Note that, throughout the paper, optimization variables will be written in bold. 

The min max approach to generalization aims at identifying which sequence of actions 
maximizes its worst possible return, that is which sequence of actions leads to the highest 
value of {Vt{^, Lf, Lp, xo, mq, • ■ • , ut-i))- 
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We focus in this paper on the design of resolution schemes for solving the program 
{VriJ'jLfjLp, xo,Uo, . . . , These schemes can afterwards be used for solving the 

min max problem through exhaustive search over the set of all sequences of actions. 

Later in this paper, we will also analyze the computational complexity of this min max 
generalization problem. When carrying out this analysis, we will assume that all the data of 
the problem (i.e., T, T,Lf,Lp, xq, uq, . . . , ut-i) are given in the form of rational numbers. 

4. The Two-stage Case. In this section, we restrict ourselves to the case where the 
time horizon contains only two steps, i.e. T = 2, which is an important particular case 
of {Vt{^i ^f'-^p' ^0, Uq, • ■ • , wt-i))- Many works in optimal sequential decision making 
have considered the two-stage case 1231 IT4l . which relates to many applications, such as for 
instance medical applications where one wants to infer "safe" clinical decision rules from 
batch collections of clinical data ifTl 1311 [32l l49l . 



In Section 4. 1 we show that this problem can be decoupled into two subproblems. While 
the first subproblem is straightforward to solve, we prove in Section |4~2| that the second one 
is NP-hard, which proves that the two-stage problem as well as the generalized T— stage 
problem {'Pt{J', L f, Lp,XQ,uo, . . . , ut-i)) are also NP-hard. 

Given a two-stage sequence of actions {uq, ui) G U^, the two-stage version of the prob- 
lem (Vt {J',Lf,Lp,xo,uo, . . . ,ut-i)) writes as follows : 



{V2{J^,Lf,Lp,xo,uo,ui)) : 



mm 

fo,fi e K 
xo,xi e X 



ro + ri, 



subject to 







2 


Xo 






fei 


2 


Xl 





Xl - 2/ 



(■tio),fco 



Xo - X 



,Vfco e {l,.. 
,Vfci e {l,.. 



, Vfcn 



|fo - fl|^ < Ll ||xo - Xill^ if Uq = Ml 
Xo = 2^0 • 



(4.1) 
(4.2) 

(4.3) 

(4.4) 
(4.5) 



For a matter of simplicity, we will often drop the arguments in the definition of the 
optimization problem and refer (V2{^ ^ L f , L p,xq,Uq^Ui)) as (7^2"'' '"^'')- We denote by 



B. 



(«0,"l) 



[F) the lower bound associated with an optimal solution of (75^"" "1') 



DEFINITION4.1 (Optimal Value B^""^"'^ (J")). Let{uo,ui) eW^, and let {r'^,rl,x.^,x.l) 
be an optimal solution to ^t:"^"","!)^ Then, 



4.1. Decoupling Stages. Let (75^("o,«i)) ^^^^ (--p^'("o.«i)^ ^j^^ following subprob- 
lems: 
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■)'(mo,"i) 



subject to 



ro-r 

Xo = Xo 



(Mo).feo 



mm ro 

Xo e A" 



Xo - x 



■,ii{ua,ui) 



subject to 



ri — r 



(ui),fci 



(«o).feo 



mm 

f 1 e M 
Xl e A- 



(«i),fci 



Xl — X 



(4.6) 



(4.7) 
(4.8) 



We show in this section that an optimal solution to (j?^""'"^^")^ can be obtained by solving 
the two subproblems and (^'p'^'^"-^''^^^^ corresponding to the first stage and the 

second stage. Indeed, one can see that the stages t = {) and t ~ \ are theoretically coupled 
by constraint \AA\, except in the case where the two actions uo and ui are different for which 



^•p(tio,Mi)^ is trivially decoupled. We prove in the following that, even in the case uq — Ui, 
optimal solutions to the two decoupled problems (t:"^'"":"!)^ ^j~,n(uo,ui)-^ ^j^^ satisfy 
constraint ( |4.4| l. Additionally, we provide the solution of (-p^(""'"i)) 



Theorem 4.2. Let (uo, ui) e U'^. Ifir^, Xq) is an optimal solution to 
(f 1 , x^ ) is an optimal solution to (7^2^^° '"^ ' ^ , then (f q ; 1^1 > Xq , Xi ) is an optimal solution to 

Proof. 

• First case: ug 7^ iti . 

The constraint ( |4.4| i drops and the theorem is trivial. 

• Second case: uo — ui . 

The rationale of the proof is the following. We first relax constraint ( |4.4[ ), and consider the two 
problems and Then, we show that optimal solutions of 

and (-p^'("0'"i)) also satisfy constraint ( |4.4| l. 



About The problem {V'^ 



/{ua.ui 



consists in the minimization of f q under 



the intersection of interval constraints. It is therefore straightforward to solve. In particular 
the optimal solution f q lies at the lower value of one of the intervals. Therefore there exists 
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(^^(«o),fcS^^K),feo*,j;("o),fco) e jr(«n) such that 

Furthermore f q must belong to all intervals. We therefore have that 



Xo — X 



(Mo),feo 



oG {l,...,n("»)}. 



(4.9) 



(4.10) 



In other words. 



fcoe{i,...,Ti("o)} 



Xq — X 



About Again we observe that it is the minimization of f i under the inter- 

section of interval constraints as well. The sizes of the intervals are however not fixed but 
determined by the variable xi. If we denote the optimal solution of by r\ and 

xj, we know that also lies at the lower value of one of the intervals. Hence there exists 

(^^{u),kl^T(u),kl^y(u),kl^ g jr(«) such that 



Furthermore f ^ must belong to all intervals. We therefore have that 



Xi — a; 



(ti),fci 



(4.11) 



(4.12) 



We now discuss two cases depending on the sign of f q — f ^. 

-HfS - fl > 
Using ( |4.9| ) and ( |4.12[ ) with index fcg, we have 



f - f 1 < Lp 



Xi — a; 



Since f □ — f i > 0, we therefore have 



Xq — X 



Xo - 



Using the triangle inequality we can write 



< ||xl-xo|| + 



Xq — X 



(u),ko 



(4.13) 



(4.14) 



(4.15) 



Replacing ( |4.15| ) in ( |4. 14| l we obtain 

\rl-r*o\ <Lp||x;^-a;o|, 
which shows that f q and f ^ satisfy constraint ( |4.4| i. 

-Iff* -f* <0 
Using ( |4.11[ ) and ( |4.10| ) with index kl, we have 



Xq — X 



{u),kl 



Ju},kl 
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r,("),fel 



and since f q — f ^ < 0, 
Using the triangle inequality we can write 



MM 



(4.16) 



(4.17) 



Replacing ( |4.17[ ) in ( |4.16| l yields 



which again shows that f q and f ^ satisfy constraint ( |4.4| i. 
In both cases f q — f ^ > and f q — f ^ < 0, we have shown that constraint (|4.4[) is satisfied. □ 



In the following of the paper, we focus on the two subproblems (T^^^""-"!)) and Cp^'t""'"!)) 

given above, we can directly obtain 



4.2 



rather than on (7^2"° "^^)- From the proof of Theorem 
the solution of 

Corollary 4.3. The solution of the problem {V2 



max 



feoe{l,...,n("o)} 



Xo — X 



("o),feo 



4.2. Complexity of (V^ 



. The problem {V2 



being solved, we now fo- 



cus in this section on the resolution of (7^2 " )■ I" particular, we show that it is NP- 
hard, even in the particular case where there is only one element in the sample = 
I (^2:("i)'i, ^{ui),i^ y(ui),i'jY In this particular case, the problem (^-p^'^""'"!)) amounts to max- 
imizing of the distance ||xi — under an intersection of balls as we show in the fol- 
lowing lemma. 

Lemma 4.4. If the cardinality ofj^^"^^^ is equal to 1: 



then the optimal solution to {v. 



ii(uo,ui) 
2 



where maximizes llxi 



satisfies 
("i^'MI subject to 



(f o),feo 



< Li 



Xq — X 



(«o),fco 



V ^x^^""^'^" r^^°^'''" y("o),fco^ £ jr(«o) 



Proof. The unique constraint concerning f 1 is an interval. Therefore f ^ takes the value of 
the lower bound of the interval. In order to obtain the lowest such value, the right-hand- side 
of ( |4.7| i must be maximized under the other constraints. □ 

Note that if the cardinality n'^""-' of .F*^ ""^ i s also equal to 1, then (p^""^"!)) can be solved 
exactly, as we will later show in Corollary 5.3 But, in the general case where n^""' > 1, this 
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problem of maximizing a distance under a set of ball-constraints is NP-hard as we now prove. 
To do it, we introduce the MNBC (for "Max Norm with Ball Constraints") decision problem: 
Definition 4.5 (MNBC Decision Problem). Given € Q'^,y' € Q'^.ji € Q,z e 
{1, . . . , /}, C € Q, the MNBC problem is to determine whether there exists x G such that 



JO) 



> C 



and 



Vze{l,. ..,/}. 



Lemma 4.6. MNBC is NP-hard. 

Proof. To prove it, we will do a reduction from the {0, 1}— programming feasibility 
problem |40|. More precisely, we consider in this proof the {0, 2}— programming feasibility 
problem, which is equivalent. The problem is, given p E N,A E Z'P^'^,b E W to find 
whether there exists x E {0, 2}'* that satisfies Ax < b. This problem is known to be NP-hard 
and we now provide a polynomial reduction to MNBC. 

The dimension d is kept the same in both problems. The first step is to define a set of 
constraints for MNBC such that the only potential feasible solutions are exactly x E {0, 2}"^. 
We define 

and 

C = d. 

For i = 1, . . . , d, we define 

with yf^ = and = 1 for all j ^ i and 7^ = rf + 3. 
Similarly for i = 1, . . . ,d,we define 

y''^'^{yr\.-.,yf^') 

with +1 ^ 2 and yf+^ = 1 for all j 7^ i and 7, = rf + 3. 
Claim 

C2d+1 \ 
Pi {xeM'' I \\x~ff <j,} -{0,2}'^ 

It is readily verified that any x E {0, 2}'' belongs to the 2d + 1 above sets. 

Consider x E M.'^ that belongs to the 2d + 1 above sets. Consider an index k E {1, . . . ,d}. 

Using the constraints defining the sets, we can in particular write 

\\{xi, . . . ,Xk-i,Xk,Xk+i, . . . ,Xd) - (1, . . . , > d 
||(a;i, . . . ,Xk-i,Xk,Xk+i, . . . ,a;d) - (1, . . . , 1,0, 1, . . . , < d + 3 
II (xi, . . . ,Xk-i,Xk,Xk+i, . . . ,a;d) - (1, . . . , 1,2, 1, . . . , l)|p < (i + 3 
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that we can write algebraically 



[xj - If + (xk -lf>d 



1) 



xl < d 



{xk-2y <d 



(4.18) 
(4.19) 
(4.20) 



By computing ( |4.19| ) — ( |4.18| l and ( |4.20| i - ( |4.18| l, we obtain a;*: < 2 and x^ > respectively. 
This implies that 



d 

E 

fc=i 



(xk -1) <d 



and the equality is obtained if and only if we have that x^ £ {0, 2} for all k which proves the 
claim. 

It remains to prove that we can encode any linear inequality through a ball constraint. Con- 
sider an inequality of the type X]j=i '^j^j — ^- We assume that a ^ and that b is even and 
therefore that there exists no x £ {0, 2}"^ such that a^x = 6 + 1. We want to show that there 
exists y e Q*^ and 7 G Q such that 



{x e {0, 2}'' I a^x <b}^{xe {0, 2y | - yjp < 7} 



(4.21) 



Let y GR^he the intersection point of the hyperplane a^x = b+1 and the line (1 • • • 1)^ 
A(ai • • • Ud)^ , A e M. Let r be defined as follows: 



We claim that choosing 7 = and y ^ y ~ ra allows us to obtain ( |4.21| i. To prove it, 
we need to show that x e {0, 2}'' belongs to the ball if and only if it satisfies the constraint 
a^x < b. Let x e {0, 2}"^. There are two cases to consider: 

• Suppose first that a^x > 6 + 2. 
Since y is the closest point to y that satisfies a^y = 6 + 1, it also implies that any point x 
such that a^x > 6 + 1 is such that — y|p > proving that: 



x(jt {xeR'^ \ \\x-yf < r^} . 
Suppose now that a^x < 6 and in particular that a^x 



k with k E N (see 



Figure 4. 1 



Let y G M"* be the intersection point of the hyperplane a^x = 6 — fc and the line (1 • • • 1)^ + 
A(ai • • • fld)-^, A G M. Since ((1 • • • 1)"^, y, x) form a right triangle with the right angle in y 



and since ||(1 • • • 1) 



P < d, we have 



\y-x\\ 



< d. 



(4.22) 



By definition of y, we have: 
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Fig. 4. 1. The case when a?" x < h. 



1 



and by definition of y and y, we have: 

\\v-y\\ > 

Since y, y and y belong to the same line, we have 

h -y\\<r- 



As (y, y, x) form a right triangle with the right angle in y, we have that 



(4.23) 



< r 



d using ( |422l i, ( |4!23] l 



2r 



E -=1 a| E,=i 



2 



Since by definition, r > ^ y E^=i '^'j + 1' '^^ ^™ write 

2 1 



This proves that the chosen ball jxelR'' | — y|p <r^} includes the same points from 
{0, 2}'* as the linear inequality a^x < b. 
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The encoding length of all data is furthermore polynomial in the encoding length of the 
initial inequalities. This completes the reduction and proves the NP-hardness of MNBC. □ 

Note that the NP-hardness of MNBC is independent from the choice of the norm used 
over the state space X. The two results follow: 

Corollary 4.7. isNP-hard. 

Theorem 4.8. The two-stage problem (^^'■°''^^^^ and the generalized T— stage prob- 
lem (VriJ^i Lf, Lp, xq, Uq, . . . , ut-i)) are NP-hard. 



5. Relaxation Schemes for the Two-stage Case. The two-stage case with only one 
element in the set was proven to be NP-hard in the previous section (except if the car- 

dinality of ri^"''' of is also equal to 1, in this case (t:"^""'"!^) is solvable in polynomial 



time as we will see later in Corollary 5.3 1. It is therefore unlikely that one can design an 
algorithm that optimally solves the general two-stage case in polynomial time (unless P = 
NP). The aim of the min max optimization problem is to obtain a sequence of actions that 
has a performance guarantee. Therefore solving the optimization problem approximately or 
obtaining an upper bound would be irrelevant. Instead we want to propose some relaxation 
schemes that are computationally more tractable, and that are still leading to lower bounds on 
the actual return of the sequences of actions. 

The first relaxation scheme works by dropping some constraints in order to obtain a 
problem that is solvable in polynomial time. We show that this scheme provides bounds that 
are greater or equal to the CGRL bound introduced in ll22ll . The second relaxation scheme is 
based on a Lagrangian relaxation where all constraints are dualized. Solving the Lagrangian 
dual is shown to be a conic-quadratic problem that can be solved in polynomial time using 
interior-point methods. We also prove that this relaxation scheme always gives better bounds 
than the first relaxation scheme mentioned above, and consequently, better bounds than ll22l . 
We also prove that the bounds computed from these relaxation schemes converge towards the 
actual return of the sequence {uq, ui) when the sample dispersion converges towards zero. 
As a consequence, the sequences of actions that maximize those bounds also become optimal 
when the dispersion decreases towards zero. 

From the previous section, we know that the two-stage problem can be de- 



coupled into two subproblems (t:'^'"''^"!^) and {V2 



-)ii[ua,ui] 



), where {V^ 



l(uo,ui)\ 



can be solved 

straightforwardly (cf Theorem |4.2[ i. We therefore only focus on relaxing the subproblem 

^.p"(«o,"i)'j. 



subject to 



mm 

fi e M 



(ui),fci 



ri — r 



< Li 



Xl — X 



(«i),fei 



Xq X 



(uo),ko 



Vfci e {l,...,n("i)} 



(5.1) 



:oe {l,...,n("°)}(5.2) 



5.1. The Trust-region Subproblem Relaxation Scheme. An easy way to obtain a re- 
laxation from an optimization problem is to drop some constraints. We therefore suggest to 
drop all constraints ( |5.1| l but one, indexed by fci. Similarly we drop all constraints \5.2\ but 
one, indexed by /cq. The following problem is therefore a relaxation of (7:"^'*^"" "i^); 
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mm 
f 1 G M 

xi e A" 



subject to 



ri — r 



(«o),fco 



Xl — X 



(ui),ki 



Xq — X 



("o),feo 



(5.3) 
(5.4) 



We then have the following theorem: 

Theorem 5.1. Let us denote by 5^(^o,«i),fco,fci ^-jr) ^/jg ^oj^„^ 

given by the resolution 



where 



Xi(A:o,fci) = y 



("0):fe0 



Juo),ko 



\y(uQ),ko _ 



(«o),*-'o 



and, if y 



(tlo),fco — T.("l).fc 



^, x^(fco, fci) can fee any point of the sphere centered in y 



^x 



{ua),ko 



(Ml),fcl 



with radius Lf\\xQ - x^^f'-'"' ||. 
Proof. Observe that it consists in the minimization of f i under one interval constraint for 
f 1 where the size of the interval is determined through the constraint \5.A\ . The problem is 
therefore equivalent to finding the largest right-hand-side of \53\ under constraint \5.A\ . An 
equivalent problem is therefore 

2 



max 
subject to 



Xl 



„(i'i),fei 



Xl - 2/ 



{ua),ko 



< Lf 



Xo 



„("o).fco 



This is the maximization of a quadratic function under a norm constraint. This problem is 
referred to in the literature as the trust-region subproblem fTT|. In our case, the optimal value 
for Xl - denoted by xJ(fco, fci) - hes on the same hne as x'^^^^-'^^ and y('^o),ko^ y^j^jj y(ua)M 
lying in between x^'^^^^''^ and x^(fco, ki), the distance between y('^o),ko ^jj^j x* (fcp, ki) being 



exactly equal to the distance between xq and x'"")'*^". An illustration is given in Figure 5.1 
□ 

Solving (7'^^"'"'^(fco, fci)) provid es us with a family of relaxations for our initial prob- 
lem by considering any combination (fco,fci) of two non-relaxed constraints. Taking the 
maximum out of these lower bounds yields the best possible bound out of this family of re- 
laxations. Finally, if we denote by b!^^'"^\t) the bound made of the sum of the solution 
of (T?^*'""'"^'') and the maximal Trust-region relaxation of the problem over all 

possible couples of constraints, we have: 



Definition 5.2 (Trust-region Bound 5^"^^'"'^ (J")) 



max 
fci e 
fco e 



j^ll{uo,ui),ko,ki 
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Xi{fco,fci) 

Fig. 5.1. A simple geometric algorithm to solve (P^^(A;o, fei)). 



Notice that in the case where n'"") and n^^i' are both equal to 1, then the trust-region 
relaxation scheme provides an exact solution of the original optimization problem (7^2" ' "^')- 

Corollary 5.3. 



5.2. The Lagrangian Relaxation. Another way to obtain a lower bound on the value of 
a minimization problem is to consider a Lagrangian relaxation. In this section, we show that 
the Lagrangian relaxation of the second stage problem is a conic quadratic optimization pro- 
gram. Consider again the optimization problem (-p^'^"" "!)) if vve multiply the constraints 
( |5.1| l by dual variables /ii, . . . , fiki , ■ ■ ■ , M„("i) > and the constraints ( |5.2| i by dual variables 
Ai, . . . , Afcf, , . . . , A„(,.o) > 0, we obtain the Lagrangian dual: 



yi{ua,ui 
LD 



max mm 
Ai,...,A„(„o) eM+ fiGM 
^1, . . . ,Ai„(-i) e M+ xi e A" 



fei=i 

„("0) 

fcn = l 



ri — r 



Xi — X 



("i).fei 



("o),fec 



2.1 



Xq — X 



(«o)>fco 



(5.5) 



Observe that the optimal value of (p^^^","!)^ known to provide a lower bound on the 
nal value of ||25l. 

Theorem 5.4. ^-p^'^^","!)^ is a conic quadratic program. 
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Proof. In \5.5) , we can decompose the squared norms and obtain 

( Tr,"(M0,"l) 

LD 



Ai, 
Ml, 



max 



min 




„("i) „("o) 

"■^p ^ + ^ Xko 

ki — l ^0 — 1 



fei=i / fci=i 



„("0) 

fco = l 



(5.6) 
(5.7) 



fei=i 



„("0) 

+ E 



ko = l 



(5.8) 



where (a, b) denotes the inner product of a and b. We observe that the minimization problem 
in f 1 and xi contains a quadratic part ( |5.6[ ), a linear part ( |5.7| i and a constant part ( |5.8[ l 
once we fix Xk„ and /i^j^ . In particular, observe that the optimal solution of the minimization 
problem is — oo as soon as the quadratic term is negative, i.e. if : 



E /^fci ^ 
fei=l 



(5.9) 



„("i) „("o) 

-Li E ^fci + E ^feo I < 0. 

ki—1 ko — 1 



(5.10) 



Since we want to find the maximum of this series of optimization problems, we are only 
interested in the problems for which the solution is finite. Observe that, since > for 
all fci, the inequality (5.9 1 is never satisfied, unless if /ifc^ = for all fci. Therefore in the 
following, we will constraint A^^ and to be such that inequalities (5.9i and (5.10i are 
never satisfied, i.e.: 

E ^/^i > 
fci=i 

„("l) „("0) 

-Li ^fci + E ^^=0 > 0- 

ki — l fcn — 1 

Once that constraint is enforced, we observe that the minimization program is the minimiza- 
tion of a convex quadratic function for which the optimum can be found as a closed form 
formula. In order to simplify the rest of the proof, we introduce some useful notations: 
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Definition 5.5 (Additional Notations). 
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■,("1) 



fci — 1 ko — l 



X 



Y 



A ^y(«0),l 



■y 



(mo):"*""' 



A — (Ai ... A„(„o)) , /X — (^1 
Vp G No, Ip is an identity matrix of size p. 



M„("i)J 



The quadratic form coming from \5.6) , \5.1\ and ( |5.8| l can be written in the form 



with 



z^\ ) e w'^+i 



A/ 



, ^ , 2L2X^ - 2yA 
' ~ ' 1 - 2f^/x 



and the constant term is given by ( |5.8| l. The minimum of a convex quadratic form z^Qz + 
z + c is known to take the value —jl^Q^^l + c. In our case, the inverse of the matrix Q is 
trivial to compute and we obtain finally that (j^'^^^O'^^^'j can be written as 



{' LD ) 



max 



\LlXn-YX\\^ (l-2f^/x) 



"("0) „^„„("i) -MLl + L 



4M 



{uo),kQ 



+ E 

ko=l 

fci=i ^ 



r("l).fel 



(5.11) 



subject to 



M > 



The optimization problem ( |5.11| l is in variables Ai, . . . , X^t^a) and /ii, . . . , Ob- 
serve that, with our notation, M and L are linear functions of the variables. The objective 
function contains linear terms in Ai, . . . , A„(uo) and /ii, . . . , /i„(ui) as well as ?i fractional- 
quadratic function ([5'|), i.e. the quotient of a concave quadratic function with a linear func- 
tion. The constraint is linear. This type of problem is known as a rotated quadratic conic 
problem and can be formulated as a conic quadratic optimization problem (||5]) that can be 
solved in polynomial time using interior point methods ||5] [32l ID . □ 
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From there, we have the following corollary: 
Corollary 5.6. V(iio, ui) e W, 



B 



ll{uo,Ui) , -j-N _A 



LD 



-\\LlXt,-YXr (l-2f^/.) 
max , , ^„ 



n 



Li 



to = l 



fci=i 



(5.12) 



(5.13) 



subject to 



M > 



In the following, we denote by ijj^'^'"^' (J^) the lower bound made of the sum of the solution 



of and the relaxation of {V'^ 



) computed from the Lagrangian relaxation: 



Definition 5.7 (Lagrangian Relaxation Bound B^j^^'"^\j^)) . 



B 



{uq.U 

LD 



ll{UQ,Ul) 



(5.14) 



5.3. Comparing the Bounds. The CGRL algorithm proposed in ||2T]|22l for addressing 
the min max problem uses the procedure described in 1201 for computing a lower bound on 
the return of a policy given a sample of trajectories. More specifically, for a given sequence 
(mq, ui) e the program {VT{J-,Lf,Lp, xo, ug, . . . , mt-i)) is replaced by a lower bound 
b'cgbj!^-^)- ^'^^ wonder how this bound compares in the two-stage case with 

that we have proposed: the trust-region bound and the 



the two new bounds of ( V: 



■,(«0,til) 



Lagrangian relaxation bound. 

5.3.1. Trust-region Versus CGRL. We first recall the definition of the CGRL bound in 
the two-stage case. 

Definition 5.8 (CGRL Bound ^cgal (•^))- V(uo,mi) e W, 



B, 



(tlO,«l) 

CGRL 



{F) = max 
ki e 

A:oe 



(■ito),feo 



„("o),feo 



Xo 



{no),ka 



„("i),fci 



The following theorem shows that the Trust-region bound is always greater than or equal to 
the CGRL bound. 
Theorem 5.9. 
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Proof. Let fc^ e { 1, . . . , n'"") } and fcj; e { 1, . . . , } be such that 



a;("o),feo* _ 



+ 



Now, let us consider the solution 5^(^0'«i).'=o.fei* (jr) of the problem iV'r^R°''''\K, K)), and 
let us denote by ij("0'"i) ^=0 '^i the bound obtained if, in the definition of the value of given 
in Corollary |4. 3 [ we fix the value of /cq to fcg instead of maximizing over all possible fcg: 



Xo — X 



j^ii{uo,ui).,ko,kl 



Since A'^o^^K _ Lp \\xo — x^""^'''o || is smaller or equal to the solution fg of (j?'^(^o,u-i)s^^ 
has: 

^{uo,u^),k;,k', ^j-^ > 5("o,«i),fcS,fei* (5 

Now, observe that: 

n{uo,ui),k*,kl _ n("o,"l)^x-\ _ r r 
^ ^CGRL ! — ^P^f 



a;(«o),fco _ 



(5.16) 



By construction, xJ(fcQ,fc^) lies on the same line as j/("o),fe(, ^jj^j (see Figure 

Furthermore 



5.1 



(tti),fc* 



Using ( |5.17| ) in ( |5.16| l yields 



n(uo,«i),feo,*;r _ n(uo,«i),fc^fct .T-x _ T T 
^ ^CGRL I ~ -'^P-'^/ 



( 



CGflL 
("o),fco _ 2,("i)^fei 



So 



(5.17) 



(5.18) 



By construction. Equation ( |5.18| l is equal to (see Figure 5.1 1, which proves the equality of 
the two bounds: 



(5.19) 



The final result is given by combining Equations (5.15 i and (5.19 1. □ 

From the proof, one can observe that the gap between the CGRL bound and the Trust- 
region bound is only due to the resolution of jsjQfg jjj^f jjj (}jg ^^gg where also 
belongs to the set argmax 7.("o),feo _ j^^ ||2,(Mo),fco _ ^.^jj^ bounds are equal. 

/coe{l,...,n("o)} 

The two corollaries follow: 

Corollary 5.10. Let{u^,ui) e W. Letk^ e {l,.. . ,n^'"''>} andkl € {l, . . . ,n("i)} 
be such that: 



a;("o),feo* _ xo 



+ r 



(«l),fci _ T 



{u(i),kl _ (Mi),fei 
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Then, 



ko€ argmax r^"°^'''° - Lp 

I fcoG{l,...,n("o)} 



^CGRL {•^) - ^TR i-^) 



(mo, til) / 



Corollary 5.11. 



5.3.2. Lagrangian Relaxation Versus Trust-region. In this section, we prove that the 
lower bound obtained with the Lagrangian relaxation is always greater than or equal to the 
Trust-region bound. To prove this result, we give a preliminary lemma: 

Lemma 5.12. Let {uo,ui) e and {ko,ki) e {l,..., X {l,...,n("i)}. 

Consider again the problem (v'r^^^°''^^\kQ, ki)^ where all constraints are dropped except 
the two defined by (fco, ki): 



TR 



mm ri 

fi e M 
xi e A" 



subject to 



Xi — a; 



("i),fci 



Xq — X 



Then, the Lagrangian relaxation of (v!pj^°'"^\kQ, ki)j leads to a bound denoted by 



B 



,ii(uo,ui),ko,k 
LD 



^ {J-) which is equal to the Trust-region bound Brf^ 



ii{uo,ui),ka,ki 



in L 



B 



ii{uo,ui),ko,ki 
LD 



ii(ua,ui),ko,ki 



Proofs of this lemma can be found in |l3l and O, but we also provide in Appendix [A| c 
proof in our particular case. We then have the following theorem: 
Theorem 5.13. 



j(tlO,Ui) , 



Proof. Let (mo,ui) G W^. Let (fcS,/c^) G {l, . . . x {l, . . . ,n("i)} be such that: 



j(«o,'Ji)^7r-, _ {;* _L R"('"o,Mi),feo^fcS; i^jr-j 



Considering (fco, ^i) — (/cq, ^i) in Lemma 5.12 we have: 



^TR ['^) -^0 + ^LD \-^) 



(5.20) 



Then, one can observe that the Lagrangian relaxation of the problem (^^'''"'^(^o^ ^i)) 



■ from which B 



r/{uo,ui),k*,kl 



LD 



(T) is computed - is also a relaxation of the problem (P^^, 



"("0," i)n 
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for which all the dual variables corresponding to constraints that are not related with the sys- 
tem transitions (a;(""^''^o , r^"")'*-'!) , j/('"o)''=o) and ^x^"!)-*^! , r^^^)'*^! , y'"!)^*^! ) would be forced 
to zero, i.e. 



max 



\LlXt^-YXr (l-2f^/.) 



-ML2 + L 



4M 



ko = l 



(«o),*:o 



fci=i ^ 

subject to M > , 

L > MLl , 



„("l),fel 



Afc„ =Oiffco^A:*,Vfcoe {l,...,n("°)} , 
=Oiffci ^fc*,Vfci e {l,..., . 



We therefore have: 



By definition of the Lagrangian relaxation bound B^d i^)^ 

we have 



(5.21) 



(5.22) 



Equations (5.20i, (5.21 1 and (5.22i finally give: 



B, 



(uo,lLl) I -pi 

TR 



□ 



5.3.3. Bounds Inequalities: Summary. We summarize in the following theorem all the 
results that were obtained in the previous sections. 
Theorem 5.14. V (uq, ui) e 



Proof. Let {uq, ui) £U^. The inequahty 



B 



("0,«l) / 

CGRL 



(tlO.Ml) , 



is a straightforward consequence of Theorems |5.9| and |5.13| The inequality 



B 



(UQ.Ul] 

LD 



(J-) < 



is a property of the Lagrangian relaxation, and the inequality 



B. 



(«0,tll) 



{F) < Jl 



(mo,ui) 



(5.23) 



(5.24) 



is derived from the formalization of the min max generalization problem introduced in 
D 
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5.4. Convergence Properties. We finally propose to analyze the convergence of the 
bounds, as well as the sequences of actions that lead to the maximization of the bounds, when 
the sample dispersion decreases towards zero. We assume in this section that the state space 
X is bounded: 

3CAr > : V(a;,2;') e A-^, \\x-x'\\<Cx. 

Let us now introduce the sample dispersion: 

Definition 5.15 (Sample Dispersion). Since X is bounded, one has: 



3 a > : Vu G hi , sup min 



< a . (5.25) 



The smallest a which satisfies equation ( [5.25p is named the sample dispersion and is denoted 
by a*{J-). Intuitively, the sample dispersion a*{J-) can be seen as the radius of the largest 
non-visited state space area. 

5.4.1. Bounds. We analyze in this subsection the tightness of the Trust-region and the 
Lagrangian relaxation lower bounds as a function of the sample dispersion. 
Lemma 5.16. 

30 0: VK,«i) e ^^^Vi?("-«l)(J■) e {s(r(^^iHj-),4"^'"^^(J-),i3i'S^"^'(J-)} , 

j{uo,u^) _ ^(„o,ui)(jr) < Ca*{T). 



Proof. The proof for the case where ^(""^"^'(J") = 5^"^^^ (J^) is given in [21 1. Ac- 



cording to Theorem 5. 14 one has 



which ends the proof. □ 

We therefore have the following theorem: 
Theorem 5.17. 

lim J^"'''"^)-S("'''"i)m = 0. 



5.4.2. Bound-optimal Sequences of Actions. In the following, we denote by i?^*^^^ (T) 

(resp. i?^*^ {J-) and -B^*^ {J-) ) the maximal CGRL bound (resp. the maximal Trust-region 
bound and maximal Lagrangian relaxation bound) over the set of all possible sequences of 
actions, i.e.. 

Definition 5.18 (Maximal Bounds). 

B 



(*) 

CGRL 






max B)-,X,a/ 

(uo,Ut)£W ^^^^ 




r>(*) 


(-^) 


_A 


max B>pj^ 

(no,tii)eW^ 




r(*) 


{^) 




max Bj^^ 
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We also denote by {uo,ui)^^^ (resp. (uo,mi)^^ and {uo,u\)^) three sequences of 
actions that maximize the bounds: 

Definition 5.19 (Bound-optimal Sequences of Actions). 

We finally give in this section a last theorem that shows the convergence of the sequences 
of actions (?io, wi)^'^^^, (""0,^1)™ and (uo,wi)^^ towards optimal sequences of actions 
- i.e. sequences of actions that lead to an optimal return J| - when the sample dispersion 
a* {T) decreases towards zero. 

Theorem 5.20. LetZ* be the set of optimal two-stage sequences of actions: 



and let us suppose that 35 7^ (ifZ2 = the search for an optimal sequence of actions 
is indeed trivial). We define 



A 

e ^ 



min (j2*- J^"°'"^)| . 

Then,\/ {uo,ui)jr € |(zto, ui)^'^^'^ , (mo,mi)™ , (uo,mi)^^|, 

(Ca*{J^) < e) ^ (wo, wi)^ e 3^ . (5.26) 

Proof. Let us prove the theorem by contradiction. Let us assume that Ca*{J^) < e. Let 
B(«o,«i)(j-) g {4"^^^)(J-), 4"^'"^)(jr), Bi"^'"^)(.F)}, and let (wq, ^ii)^ be a sequence 
such that 

(uo,Mi).;^G arg max ■"1^(7") 

(uo>«i)£i<^ 

and let us assume that {uq, ui)jr is not optimal. This implies that 

4"^°'^^^^ <J*-e. 
Now, let us consider a sequence (wq, mJ) e Then 

•^2 — "^2 • 

The lower bound B("o'"*) (J^) satisfies the relationship 

J* _bK.<)(j-) < Ca*{F). 
Knowing that Ca* {T) < e, we have 

^K,«n(jr) > J*-e. 
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Since 
we have 

which contradicts the fact that {uo,Ui)jr belongs to the set arg max 

sK:«i)(jr)^ This ends 

(mo,«i)£W^ 

the proof. □ 

5.4.3. Remark. It is important to notice that the tightness of the bounds resulting from 
the relaxation schemes proposed in this paper does not depend explicitly on the sample dis- 
persion (which suffers from the curse of dimensionality), but depends rather on the initial 
state for which the sequence of actions is computed and on the local concentration of samples 
around the actual (unknown) trajectories of the system. Therefore, this may lead to cases 
where the bounds are tight for some specific initial states, even if the sample does not cover 
every area of the state space well enough. 

6. Experimental Results. We provide some experimental results to illustrate the theo- 
retical properties of the CGRL, Trust-region and Lagrangian relaxation bounds given below. 
We compare the tightness of the bounds, as well as the performances of the bound-optimal 
sequences of actions, on an academic benchmark. 

6.1. Benchmark. We consider a linear benchmark whose dynamics is defined as fol- 
lows : 



V(a;, u) e X xU, f{x, u) ^ x + 3.1416 x u x 1^, 

where Id G M'' denotes a d— dimensional vector for which each component is equal to 1. The 
reward function is defined as follows: 



V(a;, u) G X X U, p{x, u) 



d 

E 



x{i) 



where x{i) denotes the i—\h component of x. The state space X is included in M."^ and 
the finite action space is equal to U = {0,0.1}. The system dynamics / is 1— Lipschitz 
continuous and the reward function is Vd— Lipschitz continuous. The initial state of the 
system is set to 

xo = 0.5772 X Id . 



The dimension d of the state space is set to d = 2. In all our experiments, the computation 
of the Lagrangian relaxations, which requires to solve a conic-quadratic program, are done 
using SeDuMi |47 |. 

6.2. Protocol and Results. 

6.2.1. Typical Run. For different cardinalities q = 2i^,i = 1, . . . , 15, we generate a 
sample of transitions using a grid over [0, 1]'' x U, as follows: Vu £ U, 



«1 . «2 
■ : 



,u,p 



■ 5 



«1 . «2 
■ : 
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50 100 150 200 250 300 350 400 450 
Cardinality of the samples of transitions 



Fig. 6.1. Bounds b'^qq^^ i^ci), b'^L (J^ci) and B^j*^ i^ci) computed from all samples of transitions 
i £ {!> ■ • • , 15} of cardinality Ci = 2i^. 




100 150 200 250 300 350 400 
Cardinality of the samples of transitions 



450 



Fia. 6.2. Returns of the sequences (uo,ui)^^^^, (?jo,mi)^^ and {uo,ui)^^ computed from all samples 
of transitions Ta i G {1, ■ • • , 15} of cardinality Ci = 1i? . 



and 
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We report in Figure |6.l| the values of the maximal CGRL bound 5^^^^ i-^ci), the maximal 



6.2 



^CGRL 

Trust-region bound -By*p [J- a ) and the maximal Lagrangian relaxation bound -Bj*2 {F^ ) as 
a fu nction of the cardinality Ci of the samples of transitions Fc ■ We also report in Figure 

the returns , J2 and ^' of the bound-optimal sequences of 

/ \CGRL , ^TR , , -.LD 

actions (mq, Ml jjr^ ,[UQ,ui)jr^ and (mq, uijjr^ . 

As expected, we observe that the bound computed with the Lagrangian relaxation is 
always greater or equal to the Trust-region bound, which is also greater or equal to the CGRL 



bound as predicted by Theorem 5.14 On the other hand, no difference were observed in 



terms of return of the bound-optimal sequences of actions. 




Fig. 6.3. Average values AcGRL{ci), ^Tii(ci) j4ii5(ci) of the bounds computed from all samples of 
transitions T^. fc6 {1, ... ,100} of cardinality Ci = 2i?. 



6.2.2. Uniformly Drawn Samples of Transitions. In order to observe the influence of 
the dispersion of the state-action points of the transitions on the quality of the bounds, we 
propose the following protocol. For each cardinality = 2i^,i = 1, . . . , 15, we generate 
100 samples of transitions Fa.i, ■ ■ ■ , -T^ci.ioo using a uniform probability distribution over the 
space [0, 1]*^ X U. For each sample of transition Fa.k i G {1, ■ ■ ■ , 15}, k £ {1, . . . , 100}, 
we compute the maximal CGRL bound b'^^j^j^ {J'a.k), the maximal Trust-region bound 
^TR i-^ci.k) and the maximal Lagrangian relaxation bound -B^*^ (J'ci.k)- We then compute 
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Fig. 6.4. Average values JcGRL- Jtr Jld of the return of the bound-optimal sequences of actions 
computed from all samples of transitions T^. A: S {1, . . . , 100} of cardinality Ci = 2i^. 



the average values of the maximal CGRL, Trust-region and Lagrangian relaxation bounds : 



V^e {1,...,15}, 



A-CGRLiCi) 



100 ^ ^CGRL i^c,,k) 
k=l 
100 



fc=i 

100 



100 



fc=l 



and we report in Figure 6.3 the values AcGB.L{ci) (resp. AxRici) and AldIq)) as a func- 
tion of the cardinality q of the samples of transitions. We also report in Figure 16.41 the 
average returns of the bound-optimal sequences of actions {uQ,ui)jr ^ , (woj'"i)jf j. ™d 



100 



Vi e {1, . . . , 15}, JcGRL{Ci) 
JxRiCi) 
JLoiCi) 



100 ^ ^' 
100 ^ ' 



(jlO,«l)j 

2 



fc=l 
100 



100 



(mo,«i)j 
2 



fc=l 



as a function of the cardinality Ci of the samples of transitions. 
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We observe that, on average, the Lagrangian relaxation bound is much tighter that the 
Trust-region and the CGRL bounds. The CGRL bound and the Trust-region bound remain 
very close on average, which illustrates, in a sense, Corollary |5.10| Moreover, we also observe 



that the bound-optimal sequences of actions (uq, ^ better perform on average 

7. Conclusions. We have considered in this paper the problem of computing min max 
policies for deterministic, Lipschitz continuous batch mode reinforcement learning. First, 
we have shown that this min max problem is NP-hard. Afterwards, we have proposed for the 
two-stage case two relaxation schemes. Both have been extensively studied and, in particular, 
they have been shown to perform better than the CGRL algorithm that has been introduced 
earlier to address this min-max generalization problem. 

A natural extension of this work would be to investigate how the proposed relaxation 
schemes could be extended to the T-stage (T > 3) framework. Lipschitz continuity as- 
sumptions are common in a batch mode reinforcement learning setting, but one could imag- 
ine developing min max strategies in other types of environments that are not necessarily 
Lipschitzian, or even not continuous. Additionnaly, it would also be interesting to extend 
the resolution schemes proposed in this paper to problems with very large/continuous action 
spaces. 
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Appendix A. Proof of Lemma 5.12[ 

Proof. For conciseness, we denote (x'^^^^^ '^o ^ 7-("o),feo ^ y(«o),feo~) ( resp. 



(ui).ki j,(Mi),fci ,,(ui) 



''^i)) by (a:;°, r", y°) (resp. (x^, r^, y^)), and Ai (resp. /^i) by A 



(resp. //). We assume that ^ and x^ ^ otherwise the problem is trivial. 
• Trust-region solution. 



According to Definition 5.2 we have: 



where 



01 



which writes 



"(Mo,«i),feo,*;i 

TR 



y" + LfSr^iy'--')--' 



\\y"-x^\\ 



r"'" — lly" — X"'" II — LpLf \\xo — x'^\\ 



• Lagrangian relaxation based solution. 
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According to Equation ( |5.1 we can write; 



B 



(i(ua,ui),ko,ki 
LD 



max 



4m 



subject to 



-liLl + A 



We denote by £(A, /i) the quantity: 



\Ly^i^tf\\\' (l-2rV)' 



+ x(\\y^'-L}\\x'-Xo\ 



Let A and fj, be such that A > fiLj,. Since the Trust-region solution to {Vpj^°'"^\ko, ki)) is 
optimal, and by property of the Lagrangian relaxation |25|, one has: 

In order to prove the lemma, it is therefore sufficient to determine two values Ao and /iq such 



that the inequality ( A.l i is an equality. By differentiating £(A, /i), we obtain, after a long 



calculation (that we omit here): 



/ ( dC{\,fi) _ 
dC(X.,tJ.) _ p. 



A 



2Lf\\xO-xo\\ ' 
= 2L,{\\y"-x^\\+Lf\\x«-xol\ 



We denote by Aq and jiQ the following values for the dual variables: 



1 



2Lpi\\yO-xm+Lf\\xO-xo\\) 



We have: 
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In the following, we denote yL + j;,j||a.o_Jg|| ) by K. We now give the expression of £(Ao, /xq) 
using only /xq and K: 

Ly,\\x'-Ky^' ^ 
£(Ao, Ho) = " - -^+r - (r ) no 

+f,oKLl (llt/Of - L} - xof) + Ho {{r'f - ^ f ) 

Ll,,oV-Ky°f 1 ^ 
K-1 4mo 

+L>0^ (||t/Of - L2 ||a;0 _ a;of ) - L>o • 

Using the fact that — Kxp = x^ —xp — {K — we can write: 

£(Ao,Mo) = (Ik^ - 2/°f + - 1)' Vf - m - - 2/°) V) 

+ + L>o^ (||t/°f - - xo\t) - Ll \\x'f 



and 



£(Ao, Mo) = Ik^ - y'W' - Ll^,o{K - 1) llyof + 2LXxi - y°) V 

- + + L>oi^ (||y°f - L) ^ - xof ) - jj^if . 



4Ato 



From there, 



-C(Ao,Mo) = - y°f + ||y°f (-i>o(i<: - 1) - 2L>o + i>o^) 

-2L2^o(^^) V - ^ + + ^>o^ (-^/ ll^° - ^of ) - Ik'l 

and, since (—Lpij,o{K — 1) — ^Lpjio + LpiJ,oK) = —Lp/io, we have that 

+ + Llf^oK (-L) \\x' - xof ) - \\x'\^ 

and 

AAo,Mo) = 11^^ - y°ir - i>o (||2/°ir + 11^1' - 2(x^)V) 



1 / 2 

— + + L^/zoJs: -L^ |k° - xol' 
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Since ^||y°||^ + — 2{x^)'^y'^^ = \\x^ — y^jj^, we have: 



rf\ \ ^P'^O II 1 0||2 r2 II 1 

-ClAc/Uo) = -^fzTi P y W -^pMo \\x-y" 
--^ +r^+ LlnoK (-L] \\x° - xo\ 



r2,, 11^1 ,,0||^ -"■ 

I K-1 



= -LpHo\\x -y 

|2 



--^ + + LlfiaK {~Ll \\x° - xo\ 



K-1 • / Alio 



..0 „1| 



Since K = \ \-\- n n — it , we have: 



C{Xo,iJ.o) = - I 1 + " „ ^ Llno 7 — hi/ ||a;° - xqI 

I i,iix°-«oii; 



-^pMo {Lf \\x° - xo\\ + ||y° - x^W) / Lf - a:;o|| \\x^ - y°|| ^2 ii 
Lf \\x°-xo\\ \^ + '"^ 

4/Uo 



= - i^Mo {Lf \\x° -xo\\ + - x^W) {\\x^ - yl + i>/ - a;o||) - + 
Since un = „, — n/, r n mt > we finally obtain: 

£(Ao,Mo) = (||y° - + L/ - o^oll) - ^ (||y° -x'\\^ Lf \\x' -x4)+r' 
= — Lp \\y^ ~ x^W — LpLf \\x^ — xo\\ 

p"(''0>Ml),fco,fel / T7\ 

which ends the proof. □ 
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